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Due to the very long timescales involved {fj,s—s), theoretical modeling of funda- 
mental biological processes including folding, misfolding, and mechanical unraveling 
of biomolecules, under physiologically relevant conditions, is challenging even for 
distributed computing systems. Graphics Processing Units (GPUs) are emerging as 
an alternative programming platform to the more traditional CPUs as they provide 
high raw computational power that can be utilized in a wide range of scientific ap- 
plications. Using a coarse-grained Self Organized Polymer (SOP) model, we have 
developed and tested the GPU-based implementation of Langevin simulations for 
proteins (SOP-GPU program). Simultaneous calculation of forces for all particles 
is implemented using either the particle based or the interacting pair based paral- 
lelization, which leads to a ~30-fold acceleration compared to an optimized CPU 
version of the program. We assess the computational performance of an end-to-end 
application of the SOP-GPU program, where all steps of the algorithm are running 
on the GPU, by profiling the associated simulation time and memory usage for a 
number of small proteins, long protein fibers, and large-size protein assemblies. The 
SOP-GPU package can now be used in the theoretical exploration of the mechanical 
properties of large-size protein systems to generate the force-extension and force- 
indentation profiles under the experimental conditions of force application, and to 
relate the results of single-molecule experiments in vitro and in silico. 
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I. INTRODUCTION 



Mechanical functions of protein fibers such as fibronectin, fibrin fibers, microtubules, and 



actin filaments, are important in cytoskeletal support and cell motility 



n 



the formation of the extracellular matrix j4 
of viral capsids of plant and animal viruses 
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^|, in cell adhesion and 
10|. Physical properties 
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and in blood clotting [8 
ll|-13|, retroviruses jl^, and bacteriophages 
and the transitions between their stable and unstable states determine the life cycle of many 
viruses, including virus maturation, and infection of cells l^]. Understanding the microscopic 
origin of the unique viscoelastic properties of protein fibers and the crossover from an elastic 
to a plastic behavior in viral capsids, as well as the control of their mechanical response to 
an applied mechanical force constitute major areas of research in biochemistry and biophysics. 



Single-molecule techniques, such as AFM and laser tweezer 
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Dased force spectroscopy, have been 



2l| and viral capsids 
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used to study the mechanical properties of protein fibers 
Yet, due to the high complexity of these systems (~10'^— 10^ residues) and to their large size 
(~ 50 — 200nm), these experiments yield results that are nearly impossible to interpret without 
first having some a priori information about their energy landscape jsl. 

Standard packages for all-atom Molecular Dynamics (MD) simulations, such as CHARMM 



24l |. NAMD 25[, and Gromacs [26[ among others, are being used to access the submolecular 
behaviour of biomolecules. However, because all-atomic modeling is currently limited to a 
10— 50nm length scale and 0.1 — 10/is duration 
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29| . these methods allow for the theoretical 



exploration of equilibrium properties of biomolecules, and reaching the biologically important 
ms—s timescale becomes virtually impossible even for a small system. More importantly, to 
fully explore the free energy landscape underlying a biological process of interest, one needs to 
generate a statistically representative set of trajectories. One possibility is to carry out MD 
simulations on manycore computer clusters, but it requires tremendous computational resources 
and long CPU times. For example, it takes 800, 000 CPU hours to obtain 20 short Ins MD 
trajectories for the southern bean mosaic virus, which contains as many as 4.5 million atoms, 
on an SGI Altix 4700 cluster 30|. These limitations exclude computations as an investigative 
tool in the study of a range of biological problems, such as the large deformations of protein 
fibers, the formation of biomolecular complexes and aggregates, and the mechanical failure of 
viral capsids, for which experimental data are already available, thereby rendering the direct 
comparison of the results of experiments in vitro and in silico impossible. 

Although graphics processors have been originally designed for computationally intensive 
graphics rendering, they have evolved over the last few years into highly parallel, multithreaded 
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computing devices. Recent technological advances in the throughput-oriented hardware archi- 
tecture of GPUs with extremely high peak arithmetic performance, which employs IEEE floating 
point arithmetic, have unleashed tremendous computational power that can now be utilized in 
general purpose scientific applications. Unlike mainstream processor architecture, GPUs devote 
the majority of their logic units to performing actual calculations, rather than to cache memory 
and flow control. Massive multithreading, fast context switching, and high memory bandwidth 
have enabled GPUs to tolerate latencies associated with memory calls, and to run many com- 
putational cores simultaneously (Fig. 1). Programming tools for modern GPUs include several 



platforms such as ATI Stream Computing 



(CUBA) |32 



3l| . NVIDIA Compute Unified Device Architecture 



33| . and Open Computing Language (OpenCL) [34]. CUDA, a parallel comput- 



ing environment (a dialect of the C and C++ programming languages), provides a high level 
software platform that allows a programer to define kernels that are executed in parallel by 
independent computational threads. The cost of one GeForce GTX 295 graphics card (from 
NVIDIA) with 2 GPUs (2x240 processors) is just ~$500 (i.e. ~$1 per processor), which makes 
GPUs a cheap desktop-based alternative to an expensive computer cluster. 

Because GPUs differ from CPUs in several fundamental ways, CPU-based methods for molec- 
ular simulations of biomolecules cannot be easily translated nor simply adapted to a GPU. In 
these methods, particle-particle interactions are described by the same empirical potential energy 
function (force field), and the dynamics of the system in question is obtained by solving numer- 
ically the same equations of motion for all particles. Hence, there is a direct correspondence 
between the SIMD (Single Instruction Multiple Data) architecture of a GPU at the hardware 
level and the numerical routines (software) used to follow the molecular dynamics. It is then 
possible to execute "single instruction", i.e. calculation of the potential energy or evaluation of 
forces, or generation of random forces, or integration of the equations of motion, on "multiple 
data" sets (for all particles) at the same time over many iterations using many Arithmetic Logic 
Units (ALUs) running in parallel. This makes MD simulations a natural candidate to imple- 
mentation on a GPU, but not all algorithms are amenable to this architecture. For an algorithm 
to execute efficiently on the GPU, it must be recast into a data-parallel form with independent 
threads running the same instruction stream on different data. There exist preliminary versions 
of standard packages for MD simulations of proteins implemented on a GPU, such as NAMD 



28|, 



35 



361], Gromacs 37|, and other applications 38|-|40|. Yet, there are no CPU-based imple- 



mentations of Langevin dynamics simulations that we are aware of. In this paper, we develop 
and test such an implementation. Because the topology and the overall structure (geometry), 
rather than the atomic details, govern the force-driven molecular transitions in protein systems. 



4 



we employ a coarse-grained description of proteins 4lN43| using a Self Organized Polymer (SOP) 



model 



44 



45|. 



Implementing the Langevin dynamics algorithm on a GPU requires detailed understanding of 
the device architecture. In the next Section, we review key architectural traits of modern CPUs. 
The methodology for the GPU-based implementation of Langevin simulations is presented in 
Section 3, where we describe the particle based and the interacting pair based parallelization 
approaches to force computation as well as the numerical routines for generating pseudorandom 
numbers, constructing the Verlet lists, and integrating forward Langevin equations to the next 
time step. For purposes of presentation and to focus on the most essential computational aspects, 
we have simplified the prosentation of the formalism as much as possible. A comparative analysis 
of the results of CPU- and GPU-based simulations of the mechanical unfolding for a test system 
represented by the all-/3-strand WW domain is performed in Section 4, where we also assess the 
accuracy of the numerical integration. We discuss the results of the GPU-based computations 
in terms of the simulation time, memory usage, and computational speedup (CPU time versus 
GPU time) for a range of proteins including small proteins, such as the WW domain, the Ig27 
domain from human titin, the C2A domain from human synaptotagmin (Sytl), the 7C chain 
and the double- -D fragment of human fibrinogen [Fb), single-chain models of fibrin fibers {Fb 
monomer and dimer), and large-size protein assembly (viral capsid HK97). The main results 
are summarized in Section 5. 



II. NVIDIA GPU ARCHITECTURE 

The generally used CPUs have most of their logical elements dedicated to cache and flow 
control to employ complex computational logic and to provide computational cores with fast 
memory access (Fig. 1). This makes a CPU capable of performing computations following a 
sequential workflow. On a GPU, a large number of logical elements are devoted to actual com- 
putations, and cache and flow control are reduced to a small unit. These features enable CPUs 
to achieve high arithmetic unit density and to perform the same computational procedure (s) 
simultaneously for all particles in a system by using many independent threads of execution 



to run parallel calculations on different data sets (Fig. 1) [32|, |33|. For example, on a CPU, 
vector addition in M dimensions is performed in a loop, where the components of the resulting 
vector are computed one after another. On a GPU, this procedure can be performed using M 
independent threads, each of which computes just one component. Hence, on a GPU, a vector 
sum is computed at a cost of one addition, whereas on a CPU this cost is multiplied by M. 
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Yet, unlike computer clusters, where each core is capable of following its own computational 
protocol, contemporary compute-oriented GPUs are based on the SIMD architecture, which 
mandates that an identical instruction stream be fed to the large array of processing units 32| . 
Hence, to achieve top performance on the GPU, one has to organize a computational task into 
a data-parallel form with many independent threads performing the same operation(s), but on 
different data. In addition, the task should be compute-intensive so that, most of the time, the 
GPU performs actual computations rather than reading and writing data 33|. 



A GPU contains several multiprocessors, each having its own flow control and cache unit. 
Modern GPUs from NVIDIA (Tesla C1060, GeForce GTX 285) host as many as 30 multiproces- 
sors, each with 8 ALUs (Fig. 1) 32|, which is a total of 240 computational units per chip. The 
GPU sits on an expansion board with its own (global or device) memory - 512MB on the more 
conventional GeForce cards and up to 4GB on Tesla cards - separate from the DRAM acces- 
sible by the CPU. Although the GPU memory systems provide bandwidths of over 100GB /s, 
GPU computational cores have a low amount of cache memory, which necessitates the use of a 
coalesced global memory read. Because all the 240 ALUs access the global memory simultane- 
ously, the number of memory invocations (per ALU) should be minimal to optimize the GPU 
performance. For example, on graphics cards from NVIDIA, the size of the cache memory is 
r^20KB per multiprocessor, which is sufficient to store the local data. However, when an ALU 
attempts to large window of addresses stored in an on-board memory, the cache memory 

may run low causing computational cores to spend more time waiting for data (latency), rather 
than performing computations. With a sufficient number of threads multiplexed on each ALU, 
latency is effectively hidden and the GPU device achieves a high performance level without the 
need for a cache. For example, 30, 000 threads of execution may run concurrently on the Tesla 
C1060 graphics card with a peak arithmetic rate of about 900 single precision gigaflops. Of the 
devices currently available, graphics processors produced by NVIDIA offer the most advanced as 
well as the most user-friendly environment. For this reason, we used the graphics card GeForce 
GTX 295 (NVIDIA), which has two GPUs, each with 240 1.2AGHz ALUs (30 multiprocessors) 
and 768Mi? of memory. Each multiprocessor has 16KB of cache, of which 8KB is the constant 
memory cache and 8KB is the texture memory cache (these numbers vary with GPU make and 
model). 

Unlike programming on a CPU, data driven programming on a GPU entails finding a way to 
partition the computational problem into many identical subroutines, to improve the memory 
access pattern, and then to optimize the numerical procedures themselves. At a software level, 
a programmer can define scalar program fragments, called kernels, that are executed concur- 
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rently in a thread block in parallel warps of 32 threads. A thread, an elementary unit of the 
computational workflow identified by a thread index, uses data to execute instructions that are 
listed in a source code, and saves the result (s) to a unique location to avoid memory conflicts. 
In a single multiprocessor, 32—512 threads can be combined into a thread block. Each thread 
block executes a stream of instructions within a single multiprocessor and many independent 
blocks may be run concurrently across the entire GPU device in a grid. The software model 
of the GPU also includes global, shared, and local memory 32j. Threads use their own local 



memory to store temporary variables, and threads in a block can communicate through the fast 
shared memory in the same multiprocessor, but threads in different blocks cannot communicate 
without using the global memory, which is particularly costly in terms of clock cycles required. 
Although threads in the same block can be synchronized, there is no means of synchronization of 
threads across several blocks. A particular global memory region (software), which corresponds 
to the GPU on-card memory (hardware), can be accessed by all the threads running on a GPU 
and by a CPU, but global memory calls should be cached or combined due to latency. Minimiz- 
ing the number of memory calls can be achieved by using the shared memory, data exchange 
among threads in one thread block, and an optimal alignment of the computational threads 



with the data arrays in the global memory {coalescent memory access) 32|, |33[. Caching the 



global memory can be realized by employing texture references or constant memory. 

III. LANGEVIN DYNAMICS SIMULATIONS ON A GPU 

In this Section, we describe the particle based and the interacting pair based methods for 
the parallel computation of potentials and forces due to binary particle-particle interactions. 
We also outline the numerical procedures involved in the generation of the Verlet lists and 
the random forces, and in the numerical integration of the Langevin equations of motion. We 
designed the algorithm to process as many computational fragments simultaneously on a GPU as 
possible in order to move the GPU into full production efficiency and to minimize the GPU / CPU 
communication. The algorithm decomposition, along with the workload division between the 
GPU and the CPU, is diagrammed in Fig. 2, where we also summarize the computational 
workflow on a CPU and on a GPU, including the operation on the data files for the molecular 
topology, the particle energies and coordinates, and the data flow between the CPU DRAM and 
the GPU global memory (host-CPU data transfers). 

In Langevin simulations of biomolecules, molecular forces are usually described by two-body 



(pair) potentials, such as the harmonic potential, the FENE potential 46|, the Lennard- Jones 
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potential, etc., which differ in their mathematical form. Hence, the same generic algorithm can 
be employed to compute these potentials. In the pseudocode listings below, the i-th residue for 
a protein of residues has a unique index i G [0, — 1]; for an array of particle coordinates 
f, f[i] denotes coordinates of the i-th particle. We also use the following notations: global 
memory reads are represented by <^=, saving data to the GPU global memory is shown by 
^ denotes cached texture memory reads, and ^ represents local, shared, or constant memory 
invocation, or assignments of variables. Consider the following computational procedure for 
evaluating forces on a CPU: 

Algorithm 1: Calculation of pairwise forces (CPU implementation). 

1. P <r- total number of pairs in a system for a given potential 

2. for p = to P - 1 do 

3. i pairs[p].i {index of the first particle in a particle pair} 

4. j ^ pairs[p].j {index of the second particle in a particle pair} 

5. par ^ pairs[p].parameters {parameters for the i—j pair} 

6. r ^ r[i] — f[j] 

7. df ^ force (r, par) 

8. m ^ m ± df 

9. m ^ m T df 

10. end for 

Information about the residue pairs is stored in array pairs, which has indices i and j, and 
variable par for constant parameters, which specify the potential energy. In the main cycle on 
a CPU (line 2), index p runs through all the pairs of interacting residues p=0, 1, . . . , P — 1. For 
each pair, the information about the coordinates (r[z] and r[j]) and the constant parameters 
(par) is gathered in lines 3—5. The force increment df is computed in force ( . . . ) for a given 
pair p (line 7). This value is added (or subtracted) to (from) the force for the i-th particle {f[i], 
line 8) and subtracted (or added) from (to) the force for the j-th (/[j], line 9). On a CPU, 
the force values are computed only once (lines 7—9). Because force calculations are sequential, 
they do not overlap in time. By contrast, because on a GPU a two-body potential is computed 
for different pairs of residues in different threads, a naive addition (lines 8 and 9) may cause 
memory conflicts when some or all the threads attemp to access the same address in the GPU 
global memory. 

There are two main optimization strategies that allow one to avoid this situation. In the first 
approach, all the forces for one particle are computed in one thread, which requires running 



8 



threads to obtain the force values for all particles. We refer to this procedure as the particle 
based parallelization approach. The use of this approach results in the same force, acting on 
the i-th and j-th particles, but computed twice in the i-th and j-th threads 40|. Following a 
different strategy, which we refer to as the interacting pair based parallelization approach, force 
calculations are performed for all pairs in parallel using P independent theads, and 2P force 
values are saved to different locations in the GPU global memory. We pursued in detail both 
optimization strategies, which exploit the data-parallel aspects of GPU based computing. 



A. The particle based parallelization approach 

In this approach, independent threads run on a GPU concurrently, each computing all 
the pair potentials for each particle and summing all force values (except for the random force) 
to obtain the total force. Although the force acting on the i-th and j-th particles is computed 
twice, the number of global memory calls is reduced by a factor of 2N, and the time spent on 
recalculating the same potential is compensated by the time saved by not waiting to write the 
force data to and to read the data from the GPU global memory. 
Algorithm 2: Calculation of pairwise forces using particle based parallelization. 

1. fi^O {resulting force} 

2. i ^ GPU thread index {same as particle index} 

3. fi 1= r[i] {coordinates of the i-th particle} 

4. Pi Pp[i] {number of pairs formed by the i-th particle} 

5. for p = to Pi - 1 do 

6. j <^ PairsMap[i][p].j {second particle in a pair} 

7. fj 1= r[j] {coordinates of the j-th particle} 

8. par <^ P air sMap[i][j]. parameters {parameters for the i—j pair} 

9. f fi — fj 

10. df ^ force (r, par) 

11. fi^ U± df 

12. end for 

13. Output: fi 

The array Pp keeping track of the number of pairs for all residues and the matrix PairsMap 
of all particle pairs are pre-generated on a CPU and fetched to the GPU global memory. Pp 
is an A^-dimensional vector of integers 0, 1, 2, . . .. Each element of this vector corresponds to a 
single particle, and the i-th integer value is the number of particles interacting with the i-th 
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particle. PairsMap is the NxM matrix, where M is a maximum number from array Pp. The 
i-th row of the matrix PairsMap corresponds to the i-th particle and contains the indices of 
all particles interacting with the i-th particle and the constant parameters for the potential 
energy function. Data in the PairsMap can be easily re-arranged for coalescent memory reads. 
Coordinates of the second particle for each pair are accessed at random, which allows one to 
take advantage of the texture reference in global memory reads. Force values are computed in 
parallel by threads as follows. First, the i-th thread reads the coordinates fi using a cached 
texture reference (line 3) and the number of particle pairs Pp[i] (line 4). By cycling through all 
the pairs, formed by the i residue (lines 5—12), the thread reads the index j and the coordinates 
fj of the j-th particle and the constant parameters {par, lines 6—8). These are used to compute 
the force increment df (line 10), which is added to the resulting force / (line 11) for the i-th 
particle. The parameters par depend on the potential used. For example, for a covalent bond, 
described by a harmonic potential VH=KlJ{rij — r° )^/2, par contains the equilibrium distance 
r°- and the spring constant K^J . 

Verlet lists: In molecular simulations, the information about the covalent bonds and the native 
interactions (array Pp and matrix PairsMap), obtained from the PDB structure of a protein, 
does not change. However, the information about nonbonded long-range interactions, describing 
the gradual attraction and hardcore repulsion between pairs of atoms, needs to be updated from 
time to time. This is the most computationally demanding component of the algorithm, since 
the complexity of the calculation is 0{N'^). A common approach is to take advantage of the fact 
that long-range interactions vanish over some distance. This allows one to use pair lists that 

n 

include pairs of particles that are closer than the cutoff distance (Verlet lists) [47| . In the particle 
based parallelization approach, the array Pp and the matrix PairsMap have to be regenerated 
on a GPU in order to accelerate the computation of the potential energy using Verlet lists. This 
can be done by rearranging the pseudocode for particle based parallelization (Algorithm 2): 
Algorithm 3: Calculation of forces using particle based parallelization and Verlet lists. 

1. i GPU thread index {same as particle index} 

2. Pi ^ {counter of residue pairs in Verlet list} 
2,. r[i\ {coordinates of the i-th particle} 

4. Pp j -Ppp[pi] {number of all pairs for the i-th particle} 

5. for Pp = to Pp^i — 1 do 

6. j PossihlePairsMap[i\[pj^ {second j'-th particle in a pair} 

7. fj 1= r[j] {coordinates of the j-th particle} 
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11. 



10. 



8, 



9. 



r \r[i] — r[j]\ 

if r < cutoff then 

PossiblePairsMap[i][pp] =^ PairsMap[i][pi] 

Pi ^ Pi + l 



12. 



end if 



13. end for 

14. Pi Pp\i] 

Cycling over pp includes all possible residue pairs to identify pairs that are within the cutoff 
distance. The interparticle distances are computed in line 8. The number of particles Pi that are 
within the cutoff distance is counted (line 11), and a newly found pair is added to the matrix 
PairsMap at the (i,pj)-position, i.e. copied from the matrix PossiblePairsMap (the map of 
all pairs) to the matrix PairsMap (the map of pairs). Once the cycle is completed, pi is saved 
to the array Pp[i], which stores the numbers of the residue pairs in the Verlet list. 



To avoid computing the two-body potentials on a GPU twice, one can design a different 
computational algorithm, where each thread calculates a single pair potential for two coupled 
residues. Then, forces acting on the interacting particles in opposite directions are computed 
only once, the force values obtained are saved to different locations in the GPU global memory, 
and all the forces exerted on each particle are summed up to obtain the total force. This 
approach requires additional memory calls and a gathering subroutine for the force summation, 
but it enables one to accelerate simulations when the number of residues is of the same 
order of magnitude as the number of ALUs, and/or when the computation of pair potentials is 
expensive. In the following pseudocode for the force calculation, P (number of threads) is equal 
to the number of interacting pairs for just one potential energy term, and each thread computes 
forces for one pair of residues: 

Algorithm 4: Calculation of pairwise forces using interacting pair based parallelization. 

1. p ^ GPU thread index {same as pair index} 

2. pair <^ pairs[p] 

3. par <^= PairsParameters[p] {parameters for one pair of residues p} 
A. i pair.i {the i-th particle in the pair} 

5. j ^ pair.] {the j-th particle in the pair} 

6. shifti <— pair.shifti {position in array of forces for the i-th particle} 



B. The interacting pair based parallelization approach 



11 



7. shiftj pair.shiftj {position in array of forces for the j-th particle} 

8. fi ^ f[i] {coordinates of the i-th particle} 

9. fj 1= f[j] {coordinates of the j-th particle} 

10. f i— fi — fj 

11. df ^ force (r, par) 

12. ±df =^ F[i][shifti] {saving force for the i-th particle} 

13. =F(i/ =^ F[j][shiftj] {saving force for the j-th particle} 

Each thread identifies one pair potential using the thread index p, and reads the information 
about the potential from vectors pairs and Pairs Parameters about the constant parameters 
(par), the particle identity {i and j) and particle coordinates (r, and fj), and the global memory 
addresses for saving the force values {shifti and shiftj). In the array F, the force values are 
saved to the position defined by the particle index i or j (i-th or j-th row) and by the parameters 
shifti or shiftj (columns). In the array pairs, each output position i and shifti for the i-th 
particle has to be unique so that each force value computed is saved to a different address in 
the GPU global memory. This allows one to avoid memory conflicts, but requires an additional 
gathering kernel for summing all the forces for a given particle obtained in the array F (in 
Algorithm 4): 

Algorithm 5: Gathering kernel for the force computation. 

1. z GPU thread index {same as particle index} 

2. fi^O {resulting force due to one potential energy term} 

3. Pi <^= Pp[i] {number of particle pairs for the i-th particle} 

4. for p = to Pj — 1 do 

5. df^F[^[p] 

6. fi^ h + df 

7. end for 

8. Output: fi 

Pi counts the residue pairs for the i-th particle (for one potential energy term). The total 
force for the z-th residue (/j) is calculated by summing over all the forces computed previously 
(-F[i][p], line 12—13). This part of the program can be incorporated into the integration kernel 
to minimize the number of computational kernels. On GPUs with the new Fermi architecture 
(from NVIDIA), the use of thread safe atomic addition of the computed force values to a specific 
location in the GPU global memory will help to remove the performance barriers associated with 
multiple memory calls. 
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Verlet lists: In the interacting pair based parallelization approach, generating a Verlet hst 
surmounts to forming the vector pairs of all residue pairs for one potential energy term. On a 
GPU, constructing this vector is a formidable task, since the exact position in the list, to which 
information about the next residue pair should be saved, is not known. One possibility is to 



use the atomicAdd( . . . ) routine from the CUDA Software Development Kit 32[, which allows 
one to add integers in the GPU global memory without running into memory conflicts even 
when many threads attempt to access the same memory address at the same time. However, 
when many threads run in parallel, identifying new pairs and saving them one after another 
may result in a Verlet list that is not ordered according to the particle index. This may result, 
in turn, in an inefficient utilization of the cache memory. To obtain an ordered Verlet list, it has 
to be sorted or updated on a CPU. It is more efficient to compute interparticle distances on a 
GPU, copy them to the CPU DRAM, and then generate a new list. 

C. The random force 

Langevin simulations require a reliable source of 3A^ normally distributed pseudorandom 
numbers, Qi^a {c(=x, y, and z) produced at each integration step, in order to compute the three 



components of a Gaussian random force Gj Q,=5'i.av2^^B^T^; where ^ is the friction coefficient, 
and h is the integration time step. A pseudorandom number generator (PRNG) produces a 
sequence of random numbers Ui^a uniformly distributed in the unit interval [0, 1]. This sequence, 
which imitates a sequence of independent and identically distributed (i.i.d.) random variables, 
is then translated into the sequence of normally distributed pseudorandom numbers with zero 



mean and unit variance {gi,a) using the Box- Mueller transformation [48|]. While there exist 
stand alone implementations of good quality PRNGs on a GPU, in Langevin simulations a 
PRNG should be incorporated into the integration kernel to minimize read/write calls of the 
GPU global memory. 

The simplest approach for constructing a PRNG on a GPU is to initiate an independent 
generator in each thread (one-PRNG-per-thread approach) so that pseudorandom numbers can 
be produced during the numerical integration of the Langevin equations. First, a CPU generates 

independent sets of random seeds for A^ PRNGs, and then transfers them to the GPU global 
memory. When AN i.i.d. pseudorandom numbers needed for generating 3A^ normally 

distributed random numbers Qi^a, each thread reads a corresponding set of random seeds to 
produce 4 normally distributed random numbers for each residue. Then, a PRNG updates 
its current state in the GPU global memory, which is used as an initial seed within the same 
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thread at the next time step. In a different appproach, random seeds for just one PRNG state 
can be shared among the computational threads on the entire GPU (one-PRNG-for-all-threads 
approach). Using both approaches, we have developed and tested GPU-based realizations of 
PRNG which are based on the Hybrid Taus, Ran2, Lagged Fibonacci and Mersenne Twister 
algorithms (manuscript in preparation). It has been shown that these algorithms pass a number 
of stringent statistical tests and produce pseudorandom numbers of very high statistical quality 



49|. 



Hybrid Taus algorithm: In this paper, we have implemented on a GPU the Hybrid Taus 



generator 



49| - Tausworthe algorithm combined with a Linear Congruential Generator (LCG) - 



employing the one-PRNG-per-thread approach. Hybrid Taus PRNG uses a small memory area 
since only four integers are needed to store its current state. Tausworthe taus88 is a fast equi- 
distributed modulo 2 generator 50|, |5l| , which produces pseudorandom numbers by generating 
a sequence of bits from a linear recurrence modulo 2, and forming the resulting number by 
taking a block of successive bits. In the space of binary vectors, the i-th element of a vector 
is constructed using the following transformation: yi=ai?/j_i+a2?/i_2+- • -dkyi-k, where a, are 
constant coeffitients. Given the initial values, yQ,yi, . . . yi-i, the i-th random integer is obtained 
as X i=Y2 J ^^y is-\- j - 12~ ^ , where s is a positive integer and L=32 is the machine word size. When 
Cik=ciq=aQ=l, where 0<2g< k, and aj=0 for 0<s<k—q<k<L, the algorithm can be simplified 



to a series of binary operations 



51| . Statistical qualities of pseudorandom numbers produced 



using the taus88 algorithm are high 49|] when taus88 is combined with the LCG. When the 
periods of several components of a generator are co-prime numbers, the period of a combined 
generator is the product of the periods of all the components. A similar approach is used in the 
ilSS algorithm, which combines the LCG, Tausworthe and two multiple- with-carry generators 



52l |. We used constant parameters which resulted in periods of 2'^^ — 1, 2'^'^ — 1, and 2^*^ — 1 
for three Tausworthe generators and a period of 2"^^ for the LCG generator; the period of the 
combined generator is ~2^^^>2x 10'^^. In the pseudocode below, superscripts h and d denote the 
host (CPU) and the device (GPU) memory, respectively; a section of the code, executed on the 
GPU, is in the same listing as the code for the CPU. In the CUDA program, the corresponding 
code for the GPU is organized into a separate kernel: 
Algorithm 6: Hybrid Taus PRNG. 

Require: yi[N], y2[N], yt}[N] and y!l[N] allocated in CPU memory 
Require: yf[N], y2[N], y^N] and yi[N] allocated in GPU global memory 
1. y^[l . . .N] to y!l[l . . .N] ^ Initial seeds. 
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2. ...N]toy!^[l...N]^yf[l...N]toyi[l...N] {copying initial seeds to GPU} 
Beginning of the GPU code 

3- jth <— thread index 

4. Z/i, 1/2, Vs and <= yf[jth], ytljth], yHjth] and yfljth] {loading the state of a generator} 

5. for « = 1 to 4 do {four pseudorandom numbers are to be generated} 

6. 6 ^ < cn) XOR yi) > C2i) 

7. 2/i^(((yiANDci)<C3i)XOR6 

8. (((^2 < C12) XOR y2) > C22) 

9. 2/2 ^ (((^2 AND C2) < C32) XOR 6 

10. b ^ (((2/3 < C13) XOR ^3) > C23) 

11. ys ^ {{{ys AND C3) < C33) XOR 6 

12. ^4 ^ a?/4 + c 

13. Output mw/t X ( XOR yi XOR 1/2 XOR y^ XOR 7/4) 

14. end for{generating the next random number} 

15. yi, 2/2, 2/3 and 2/4 ^ ?/f[jt/i], 2/2bf/i]) yiijth] and 2/f[jt/,] {saving the current state of the 
generator} 

End of the GPU code 

In this listing, 6 is a temporary unsigned integer variable, 2/1,2/2,2/3, and 2/4 are unsigned integer 
random seeds for three Tausworthe generators (lines 6 to 11) and one LCG (line 12), and XOR 
is a binary operation of exclusive disjunction; mM//:=2.3283064365387x 10^^*^ is a multiplier that 
converts a resulting integer into a floating point number (between and 1); Cii=13, C2i=19, 
C3i = 12, C2i=2, C22=25, C23=4, C3i=3, C32=ll, C33=17, Ci =4294967294, C2=4294967288, and 
C3=4294967280 are constant parameters for the Tausworthe generators SjJ, and a=1664525 
and c=1013904223 are constant parameters for the LCG f 



D. The numerical integration kernel 

On a GPU, the Langevin equations of motion can be solved simultaneously for all particles 
in A^ threads working in parallel. When the particle based parallelization is utilized, the subrou- 
tines for the force computation can be incorporated into the integration kernel. This allows a 
programmer to use coordinate variables, stored locally in the GPU global memory, that are read 
only once at the beginning of the computational procedure and are passed to the next subrou- 
tine. Since all the interactions are more or less local, texture cache can be used as well to access 
the coordinates in the GPU global memory. When the interacting pair based parallelization is 
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employed, the force computations can be performed in a separate kernel and the summation of 
all the forces (gathering kernel, Algorithm 5) can be done inside the integration kernel. Using 
one kernel for the force computation, the force summation and the numerical integration mini- 
mizes the number of kernel invocations on the CPU, thus, saving time for context switching on 
the GPU. 

Algorithm 7: Numerical integration of the Langevin equations of motion. 

1. i ^ GPU thread index {same as particle index} 

2. vecTi ^ ^^](^n) {reading coordinate of the i-th particle at the beginning of each step} 
3- ft ^ J2v fi,-" {total force exerted on the z-th particle due to several pair potentials} 

4. Qi {gxi9yi9z) {3D vector with 3 normally distributed random numbers} 

5. "Ti r j + fih/E, + gi\j2kTh/ {1-st order integration scheme} 

6. fi =^ r[i]{tn+i) {saving coordinates to global memory at the end of each step} 

Each thread computes the displacement vector for just one particle. Once particle coordinates 
are retrieved from the GPU global memory via texture reference (line 2), they are used in the 
computational procedures that follow (the total force is computed in line 3). The corresponding 
forces fi^y, where the index v is running over different potential energy terms, are computed 
using ether the particle based or the interacting pair based parallelization approach. When the 
former approach is used, the entire computational procedure from the force computation to the 
random force generation, and to the numerical integration can be organized into a single kernel. 
In the latter case, an additional gathering kernel is needed to compute the total force (line 3). 
Continuing, particles are shifted to their new positions (line 5), which are saved to the GPU 
global memory (line 6). Since these coordinates are used at the next time step tn+i? they have 
to be moved from the time layer r[z](t„+i) to the time layer r[i](t„) at the end of each iteration. 



IV. THE SOP-GPU PROGRAM FOR LANGEVIN SIMULATIONS OF PROTEINS 



A. The SOP model 



We employed the methodology for the CPU-based realization of Langevin dynamics to de- 
velop a CUDA program for biomolecular simulations fully implemented on a GPU. To describe 



the mo^ 
gram) 



ecular force field, we adapted the Self Organized Polymer (SOP) model (SOP-GPU pro- 



44j . Previous studies have shown that the SOP model describes well the mechanical 



properties of proteins, including the Green Fluorescent Protein 5^] and the tubulin dimer 55 1. 
In the SOP model, each residue is described using a single interaction center (C^-atom). The 
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potential energy function of a protein conformation V, specified in terms of the coordinates 
{r}=ri, r2, . . . , tat, is given by 

T/ _ T/ I T/ATT I T.^REP _ 



i=l 




(1) 



In Eq. ([1]), the finite extensible nonlinear elastic (FENE) potential Vfene describes the backbone 
chain connectivity. The distance between two next-neighbor residues i and z+1, is j+i, while 
r^ij^i is its value in the native (PDB) structure, and Rq=2A is the tolerance in the change of a 
covalent bond (first term in Eq. ([1])). We used the Lennard- Jones potential (V/^J-^) to account 



for the non-covalent interactions that stabilize the native state (second term in Eq. ([T])). We 
assumed that, if the noncovalently linked residues i and j {\i — j|>2) are within the cutoff 
distance Rc=8A, then Ajj=l, and zero otherwise. We used a uniform value for en=1.5kcal/mol, 
which quantifies the strength of the non-bonded interactions. All the non-native interactions 
in the V^§^ potential are described as repulsive (third term in Eq. ([T])). Additional constraint 
are imposed on the bond angle formed by residues i, and i+2 by including the repulsive 
potential with parameters er=lkcal/mol and (Tj^j+2=3.8A, which determine the strength and the 
range of the repulsion. To ensure the self-avoidance of the protein chain, we set (7=3. SA (last 
term in Eq. ([T])). 

B. Benchmark simulations 

We carried out test simulations of the mechanical unfolding for the all-/3-strand domain WW 
from the human Pinl protein (PDB code IPIN, Table I) using the SOP-GPU program. The 
rationale behind choosing this protein as a test system is two-fold. First, the WW domain is 
of particular interest to the field of protein folding and dynamics, and several research groups 
have expended considerable efforts to characterize the biophysical and biochemical properties 



of this protein [29|, |56|, |57|. Secondly, this is the smallest known independently folding all-/3 
domain and the all-/3 protein architecture is the primordial structural state that can be studied 
experimentally using single-molecule force spectroscopic techniques such as AFM, and laser and 
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optical tweezers (58|, |59|. For these reasons, the WW domain has been extensively used in the 
theoretical exploration of protein folding and unfolding. 

We consider the following principal sources of error: (1) precision issues arising from the 
differences in single precision (GPU) and double precision (CPU) IEEE floating point arithmetic, 
(2) possible read/write errors in the GPU global memory (hardware), and (3) accuracy of the 
SOP-GPU program, i.e. possible errors in the numerical routines (software). We report on our 
implementation of the SOP-GPU package on the NVIDIA GeForce GTX 295 (Section II) and 
compare it against a dual Quad Core Xeon 2.66GHz, considered to be representative of similar 
levels of technology. All CPU/GPU benchmarks have been obtained on a single GPU and a 
single CPU. To obtain the dynamics of the force- induced molecular elongation, the Langevin 
equations of motion for each residue have been integrated numerically using the first-order 



integration scheme (in powers of the integration time step h) 60 1 



r,{t + h) = r,(t) + f{r,{t))^h + G^t), (2) 

where Gi is the random force, and f{ri)=—{dV{ri)/dri) is the total force due to the covalent 
and the noncovalent interactions (Eq. ([1])) exerted on the i-th particle. Benchmark simulations 
of the mechanical unfolding of the WW domain have been carried out at room temperature 
{kBT=4.14pN/ nm) over 4x10^ iterations with the time step h=20ps, using the standard bulk 
water viscosity (.^=7.0 x lO^pNps/nm). Each trajectory has been generated by fixing the A^- 
terminal end and pulling the C-terminal end of the WW domain with the time-dependent 
mechanical force fext{t)=rft in the direction corresponding to the end-to-end vector, and us- 
ing the force-loading rate rf=K,uo, where K=35pN/nm is the cantilever spring constant and 
t'o=2.5/im/s is the pulling speed. 

The results of the CPU- and the CPU-based computations are presented in Fig. 3, where we 
compare the force-extension curves f{R) and the average temperature (T) for two representa- 
tive trajectories of unfolding, and the distributions of unfolding forces p{f*) sampled from 260 
trajectories on a CPU and 300 trajectories on a GPU. Temperature conservation {d{T)/dt), me- 
chanical work performed on the system (w= /^^'" f{R)dR), and the distribution of peak forces 
(/*'s) are rigorous physical metrics for measuring the precision of molecular simulations. Aside 
from small deviations due to the different initial conditions, the profiles of f{R) and (T), ob- 
tained on the CPU, are close to the profiles of the same quantities, generated on the GPU. A 
small drop in (T) is due to the onset of the unfolding transition in the WW domain, which occurs 
at t^O.lbms. Both the CPU- and the GPU-based calculations of the histogram of unfolding 
forces p{f*) result in similar values of the average force, i.e. {f*)^120.56pN on the CPU and 
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(/*)~120.94pA^ on the GPU, but slightly different standard deviations of af*^5.83pN on the 
CPU versus af*^6.58pN on the GPU due to the small sample size (Fig. 3). The magnitude of 



the critical force for unfolding is wel 



/3-strand single domain proteins 
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within the 60pA^— 200pA^ force range observed for mostly 



61|. 



C. Accuracy of the numerical integrators 

Langevin simulations fully implemented on a GPU enable one to obtain long trajectories of 
protein dynamics generated over as many as 10^—10^° iterations. Consequently, there emerges 
a question about the numerical accuracy of the integration scheme used. In Langevin simula- 
tions of proteins, the equations of motion are solved numerically using the first-order integrator 



(Eq. ([2])) 60|. However, the magnitude of the associated numerical error, which may, potentially, 
add up over many billions of iterations, is not known. We assessed the numerical accuracy of 
integration protocols by considering the mechanical unfolding of a protein. To test the results 
of simulations of the protein extension AX{t)=X(t)—Xo against the theoretical predictions, 
we used an exactly solvable model of a Brownian particle X{t) evolving in a one-dimensional 
harmonic potential, V(X)=K,^ JX — Xo)^/2, where Xq is the equilibrium position and Kgp is 



the molecular spring constant 



62|. 



Under the non-equilibrium conditions of the time-dependent force application, fext{t)='rft, 
the average particle position (the end-to-end distance), computed theoretically, is given by 

{X{t)U = Xoe^"^ + lll{t-T{l- e-*/^)) , (3) 

where T=S,/Ksp is the characteristic timescale. In pulling simulations, the average particle 
position at the step ra+l, (X(t„+i)), where tn=nh, can be obtained recursively from the average 
position obtained at the previous n-th step, (X(t„)), using the first-order integration scheme, 

(X(Wi))..™ = {X{Q) + - i^M^ h, (4) 



or the second-order integration scheme. 



Hence, Eq. (E]), (jlj), and ([5]) can be used to assess the accuracy of the numerical integrators. 

We carried out calculations of {AX(t)) at room temperature for n=10^ iterations using 
Xq=0 as initial condition, which corresponds to the initial molecular extension of AX(0)=0, 
Ksp=4:0pN/nm, which is within the 20— bOpN/nm range of values observed in the experimen- 
tal force-extension curves of proteins. We used the time steps of h=l, 25 and 50ps. We set 
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K=10pN/nm and vo=\^m/ which translates to rf=10~^pN/ns. These are typical values of a 
cantilever spring constant and a pulling speed used in AFM experiments. We set the diffusion 
constant to D=kBT/C,=1.5xlO~^^c'm'^ /s, which corresponds to the force-driven ~150?7,m exten- 
sion of the fibrinogen molecule observed in AFM experiments over time t=0.1s 21|. The slow 
force-driven "diffusion of the molecular extension", AX{t), described by the Brownian particle 
model with £)^10~^^cm^/s, should not be confused with the free Brownian diffusion of protein 
molecules in aqueous solution, for which D is in the 10~^—10~'^ cm^/s range. 

We found that the average extensions {AX{tn+i))sim, calculated using the first order (Eq. (jlj)) 
and the second order (Eq. ([5])) integrators, agree very well with each other and with the theo- 
retical curve of this quantity {AX{t))th (Eq. ([3])) for all values of h. The simulated data points 
of {AX{tn+i))sirn practically collapsed on the theoretical curve {AX(t))th in the entire range of 
time (data not shown). To quantify any reduction in accuracy, we estimated the relative error 
of the molecular extension, \{AX(tn^i))sim—{AX{t))th\/{AX{t))th, accumulated at the end of 
each trajectory and averaged over 10^ runs. The relative error was found to be less than 2x 10~^ 
(lxlO~^) for the first-order (second-order) scheme, significantly below the 10~^ level considered 
the acceptable maximum for relative error in biomolecular simulations. These results show that 
in a stochastic thermostat (random force) the numerical integration errors, associated with the 
calculation of the global mechanical reaction coordinate AX(t), are minimal and/or cancel out, 
and that single precision arithmetic is adequate for production runs. Hence, in the context 
of long simulations of biomolecules on a GPU, the first-order integrator (Ermak-McCammon 
algorithm) can be used to describe accurately their mechanical properties under physiologically 
relevant conditions of force application. 



D. Performance measurements 



We have compared the overall performance of an end-to-end application of the SOP-GPU 
program with the heavily tuned CPU-based implementation of the SOP model (SOP-CPU 
program) in describing the Langevin dynamics of the domain WW at equilibrium (Fig. 4, Table 
I). To fully occupy the GPU resources, we profiled the computational performance of the SOP- 
GPU program as a function of the number of independent trajectories running concurrently on 
a single GPU. We refer to this as the many-runs-per-GPU approach. Alternatively, we could 
have assessed the performance of the program by running one trajectory on a single GPU, but 
for a range of systems of different size A^, which we refer to as the one-run-per-GPU approach. 
The results obtained indicate that for a small system of 34 residues (W^VT domain. Table I), 
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the use of a single GPU device allows one to accelerate simulations starting from 3 independent 
runs (for small systems, there is insufficient parallelism to fully load the GPU), which is also 
equivalent to running one trajectory on a single GPU for a system of ~10^ residues, such as the 
domains Ig2'l and C2A (Table I). While the simulation time on the CPU scales linearly with 
the number of runs, the scaling in this regime on the GPU is sublinear (nearly constant) until 
the number of runs is ~100. At this point, depending on the number of threads per thread 
block, the GPU shows significant performance gains relative to the CPU reaching its maximum 
25— 30-fold value (the speedup is shown in the inset of Fig. 4). We ran the simulations long 
enough to converge the speedup ratio (n=10^ steps of size h=4Qps). Beyond this point, the 
GPU device is fully subscribed and the execution time scales linearly with the number of runs 
as on the CPU. 

In general, the total number of threads of execution Mth=TnBB, defined by the number of 
thread blocks of size B, is roughly equal the product of the system size and the number of 
trajectories s running concurrently on the GPU, i.e. Mth=Ns. Because it is impossible to predict 
which block size B will result in the best performance, we carried out benchmark computations 
for B=1Q, 64, and 256 for purposes of performance comparison. Our results indicate that all 
ALUs must to be fully loaded and that Mth should exceed the number of ALUs on a single 
GPU by a factor of 10—15. For example, on a graphics card with 240 ALUs (GeForce GTX 
280 or GTX 295, or Tesla C1060), Mj/,^3, 000-3, 500, and in the case of WW domain (A^=34) 
this translates to s~100 (Fig. 4). This implies that the use of small thread blocks is more 
advantageous when Mt/i<3, 000, e.g., when simulating one trajectory for a system of the size of 
fibrinogen monomer Fb (A^~2, 000) or 30 runs for a system with A^^lOO {Ig27), or ~50 runs 
for a smaller system such as the WW domain (A^=34). However, larger thread blocks should be 
used when Mth>3, 000 to simulate a few trajectories for larger systems such as the fibrinogen 
dimer {Fb)2 {N=3, 849) or to obtain one trajectory for a very large system, e.g., the viral capsid 
HK97 (A^=115, 140, see Table I). 

To profile the associated computational time and memory demand for the SOP-GPU software 
we ran a series of benchmark simulations on several different protein systems (Table I) using the 
one-run-per-GPU approach. The execution time of an end-to-end application of the program as 
a function of particle count was recorded using the standard CUDA runtime profiling tool. To 
more closely assess the performance characteristics of the SOP-GPU code as a function of the 
system size A^, we analyzed one simulation run for each protein system, generated over n=10^ 
steps of size h=AOps and block size of 5=64 threads (Fig. 5). For all test systems of less than 
~3, 000 residues the associated simulation time remains roughly the same, i.e. (1 — 3)xl0~^ 
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seconds per step. This is not surprising since Mth'^S, 000 is the amount of threads needed to 
fully utilize the GPU resources. For larger systems (A^>3,000), the computational time scales 
roughly linearly with (Fig. 5). Small variation in runtime versus around the monotonic 
linear dependence is due to the different native topology of the test systems, i.e. the number of 
native and non-native contacts in the PDB structure (Table I). The amount of on-board memory 
in contemporary graphics cards - r^lGB (GeForce GTX 200 series) and AGB (Tesla C1060) - 
is sufficient for Langevin simulations of large biomolecular systems comparable in size with the 
fibrinogen dimer {Fh)2- The amount of on-board memory will most likely increase with the next 
generation of graphics cards with new Fermi architecture from NVIDIA (up to ~6G-B) 63 1. 



V. CONCLUSION 



We have developed and tested, to the best of our knowledge, the first GPU-based implemen- 
tation of Langevin simulations of biomolecules, where the particle based and the interacting 
pair based approaches have been employed in the parallel computation of forces due to the 
covalent and non-covalent interactions governed by the standard pair potentials. We have pre- 
sented the numerical routines for the generation of pseudorandom numbers using the Hybrid 
Taus algorithm to describe random collisions of a biomolecule with solvent molecules, for the 
construction of Verlet lists, and for the numerical integration of Langevin equations of motion 
based on the first order scheme. Although we focused on the C^-based coarse-grained SOP 
model of a protein, which involves only two-body potentials in the potential energy function, 
the developed formalism can be used in conjunction with more sophisticated biomolecular force 
fields to explore, e.g., protein-protein and protein-DNA interactions, and it can also be extended 
to include the three-body (angle) potentials to describe side chains. In addition, the numerical 
integration kernel can be modified to follow the Langevin dynamics in the underdamped limit, 
in order to describe the thermodynamics of biomolecular transitions. 

The developed formalism has been mapped into a standard CUDA code for Langevin sim- 
ulations of biomolecules (SOP-GPU program). Benchmark simulations have shown that for 
a test system of the all-/3 WW domain the results of simulations of the mechanical denatu- 
ration on the GPU agree well with the results obtained on the CPU (Fig. 3). In a separate 
work, we also compared the forced unfolding data, obtained on the CPU and on the GPU, for 
the human synaptotagmin Sytl (manuscript under review) and for the human fibrinogen Fh 
molecules (manuscript in preparation), and found that the results of the CPU- and GPU-based 
computations agree very well. Using an exactly solvable model of a Brownian particle evolving 
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in a harmonic potential, we have assessed the accuracy of the numerical integration of Langevin 
equations of motion in describing the force-induced elongation of a protein chain. We found that 
using the first-order integrator is sufficient to accurately describe the force-driven elongation of 
a protein over many billions of iterations. 

GPUs can be utilized to generate a few trajectories of Langevin dynamics for a large system of 
many thousands of residues (one-run-per-GPU approach) or many trajectories for a small system 
composed of a few hundreds of amino acids (many-runs-per-GPU approach). Using the one- 
run-per-GPU approach, we analyzed the relative CPU/GPU performance of the program, and 
found that the GPU-based realization leads to a substantial ~25— 30 computational speedup, 
which depends on the number of threads in a thread block (similar acceleration can also be 
achieved using the one-run-per-GPU approach). We profiled the SOP-GPU program in terms 
of the computational time and the memory usage for a range of proteins (Table I). Our results 
show that the simulation time on a GPU remains roughly constant for a system of A^f^lO^ — 10^ 
residues, and scales linearly with N for a system of A^>10'^ residues. The GPU on-board mem- 
ory in contemporary graphics cards (GeForce 280, GeForce 295, and Tesla C1060) is sufficient 
to describe Langevin dynamics for large systems involving as many as ~10^ residues (Fig. 5). 
Describing mechanism(s) and multiple pathways underlying biomolecular transitions and resolv- 
ing the entire distributions of the molecular characteristics requires gathering of the statistically 
significant amount of data. This can be achieved on a single GPU using the many-runs-per-GPU 
approach for smaller systems, and on multiple GPUs employing the one-run-per-GPU approach 
for larger systems. The many-runs-per-GPU approach can also be utilized in conjunction with 
parallel tempering algorithms, including variants of the Replica Exchange Method, to resolve 
the phase diagrams of biomolecules. 

The results obtained attest to the accuracy of the SOP-GPU program. The SOP-GPU 
software can now be utilized to describe the mechanical properties of proteins, the strength of 
the noncovalent bonds that stabilize protein-protein complexes and aggregates, and the physical 
properties of large-size protein assemblies. A combination of the SOP model and the GPU-based 
computations enables one to carry out molecular simulations in reasonable wall-clock time in 
order to explore the unfolding micromechanics of protein fibers and visco-elastic properties of 
biomolecular assemblies using the experimental protocol of force application. This allows one to 
interpret the experimental force-extension curves and force-indentation profiles of biomolecules, 
obtained in dynamic force spectroscopy assays, thus, bridging the gap between theory and 
experiments. For example, on a GPU GeForce GTX 280 or GTX 295 it takes only ~7 days to 
generate a single unfolding trajectory for the fibrinogen monomer Fb using experimental pulling 
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speed of 2.5/im/s. By contrast, it would take as long as ~8 months to obtain just one trajectory 
using the CPU version of the program on a 2.Q6GHz Intel Core 17 with 6GB of memory. It takes 
~35 days to generate one force-indentation curve for the viral capsid HK97 on a single GPU 
(GeForce GTX 200 series, Tesla C1060) using experimental pulling speed of lOfim/s. Hence, 
many force-indentation trajectories can be generated on a single desktop computer equiped with 
several CPUs. 

Beyond that, we note that due to rapid evolution of GPU hardware, the simulation time on 
the GPU will decrease significantly with the introduction of the MIMD (Multiple Instruction 
Multiple Data) based Fermi architecture (NVIDIA) in the near future 63|. The CUDA program, 
tested in this work, should be able to run with minor modifications on future hardware making 
use of increased processing resources, and will permit the theoretical exploration of a range 
of interesting problems for which the experimental data are already available. The presented 
formalism can also be applied to the studies of other systems, including molecular motors, 
coUoidosomes and liposomes. 
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FIGURE CAPTIONS 

Fig. 1. Device arcliitecture of a CPU (left panel) and a GPU (right panel). On a CPU, a 
significant amount of the chip surface provides cache and flow control for all ALUs. A CPU 
is capable of storing large amount of data on the DRAM (Dynamic Random Access Memory). 
Most of the GPU chip surface is devoted to computational units, and a graphics card has on- 
board global memory. On the GPU, ALUs are grouped into multiprocessors, each of which has 
its own small flow control and cache units. To execute a simulation protocol on the GPU device, 
the CPU and the GPU communicate via the PCI Express bus, which allows the CPU to execute 
computational kernels on the GPU and to access GPU global memory. 

Fig. 2. The numerical algorithm and computational procedures for the Langevin simulations 
of biomolecules on a GPU. The computational workflow is shown using black arrows, and data 
transfer - read and write operations from and to the DRAM and the HDD (hard drive) on the 
CPU device and the GPU global memory - are represented by the dashed arrows. Designing 
CUDA kernels involves the decomposition of work into small fragments that can be mapped into 
thread blocks, and further decomposition into warps and into independent threads of execution. 
The computational workflow for only one (z-th) thread running on the GPU and the workload 
division between CPU and GPU are shown in detail. The execution of the program is initiated on 
the CPU, which is used to prepare and store the initial and output data. The CPU device starts 
the launch of each computational kernel on the GPU for the calculation of forces, generation of 
the random forces and Verlet lists, and for the numerical integration of the Langevin equations 
of motion, using many independent threads running in parallel. 

Fig. 3. Comparison of the results of pulling simulations for the all /3 lypF-domain (Table I) 
obtained on a CPU and on a GPU (the color code is explained in the graphs). Panel (a): 
Representative examples of the dependence of mechanical tension experienced by the protein 
chain / as a function of the molecular extension R (force-extension curves) obtained using 
running averages over 500 data points. Panel (6): The histogram based estimates of the distri- 
bution of unfolding forces p{f*), i.e., peak forces /* extracted from the force-extension curves. 
The histograms have been constructed using the bandwidth (bin size) of hf*^3.6pN. Panel (c): 
Representative examples of the time dependence of the average temperature of the protein chain 
(T(t)) (in units of /csT), which correspond to the force-extension curves (panel (a)), obtained 
using running averages over 500 data points. 

Fig. 4. The many-runs-per-GPU approach: shown are the simulation time on the CPU and on 
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the GPU on a log- log scale and the relative CPU/GPU performance (computational speedup) 
of the end-to-end application of the SOP-GPU program (the inset) as a function of the number 
of equilibrium simulation runs for the all-/3-strand WW domain. While a single CPU core 
generates only one trajectory at a time, a GPU device is capable of running many independent 
trajectories at the same time. The relative CPU/GPU performance is tested for the thread 
block size of 5=16, 64, and 256 computational threads of execution. 

Fig. 5. The one-run-per-GPU approach: shown are the simulation time on the GPU on a log- log 
scale and the memory usage (the inset) on a GPU as a function of the system size (number of 
residues) N. Results are presented for several test systems, including small proteins (the WW- 
domain, the Ig27 domain from human titin, the domain C2A from human synaptotagmin Sytl), 
large proteins fragments (the 7C chain and the double-D fragment from human fibrinogen Fb), 
long protein fibers (the Fb monomer and dimer) , and large-size protein assembly (the viral capsid 
HK97). The information about the native state topology for these biomolecules is summarized 
in Table I. 
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TABLE I: Number of residues, covalent bonds, native contacts stabilizing the folded state, and 
residue pairs for a range of proteins (VFVF-domain, Ig27, C2j4-domain, 7C and j3C chains), protein 
fibers {Fh monomer and dimer) and protein assembly (viral capsid HK97) used in the benchmark 
simulations. 



Protein 


WW 


1^27^ 


CIA" 




D-D" 


Fbf 




HK97 ^ 


PDB code 


IPIN 


ITIT 


2R83' 


iMiy 


IFZB 


3GHG 


3GHG 


IFTl'^ 


Residues 


34 


89 


126 


517 


1,062 


1,913 


3,849 


115,140 


Covalent bonds 


33 


88 


125 


521 


1,072 


1,932 


3,839 


114,720 


Native contacts 


65 


255 


328 


1,770 


3,498 


5,709 


12,560 


467,904 


Non-native pairs 


463 


3,573 


7,422 


131,101 


558,833 


1,821,212 


7,389,077 


16,178,028' 



'^All-/?-strand PFTF-domain. 
^Ig21 domain of human titin. 
'^C2A-domain from human synaptotagmin Sytl. 
'^'yC and /3C domains from human fibrinogen Fb. 
'^Double-D fragment (D—D interface) of human fibrinogen Fb. 
Human fibrinogen monomer Fb. 

^Human fibrinogen dimer {Fb)2 created from two Fb monomers (3GHG) and the D—D interface 
(IFZB). 

^HK97 is Head II viral capsid. 
^C2A domain of human Sytl protein. 

^^C and /3C chains in human Fb starting fom the CYS ring in the D-domain. 



'^PDB code for a structural unit; the full HK97 capsid structure can be found in the Viper 
database. 
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Based on a cut-off distance of 200A. 
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